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Abstract 

Natural Language Processing (NLP) systems 
commonly leverage bag-of-words co-occurrence 
techniques to capture semantic and syntactic 
word relationships. The resulting word-level 
distributed representations often ignore morpho¬ 
logical information, though character-level em¬ 
beddings have proven valuable to NLP tasks. 
We propose a new neural language model in¬ 
corporating both word order and character or¬ 
der in its embedding. The model produces sev¬ 
eral vector spaces with meaningful substructure, 
as evidenced by its performance of 85.8% on a 
recent word-analogy task, exceeding best pub¬ 
lished syntactic word-analogy scores by a 58% 
error margin (Pennington et al., 2014). Further¬ 
more, the model includes several parallel training 
methods, most notably allowing a skip-gram net¬ 
work with 160 billion parameters to be trained 
overnight on 3 multi-core CPUs, 14x larger than 
the previous largest neural network (Coates 
etal., 2013). 


1. Introduction 

NLP systems seek to automate the extraction of useful in¬ 
formation from sequences of symbols in human language. 
These systems encounter difficulty due to the complexity 
and sparsity in natural language. Traditional systems have 
represented words as atomic units with success in a variety 
of tasks (Katz, 1987). This approach is limited by the curse 
of dimensionality and has been outperformed by neural net¬ 
work language models (NNLM) in a variety of tasks (Ben- 
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gio et al., 2003; Morin & Bengio, 2005; Mnih & Hinton, 
2009). NNLMs overcome the curse of dimensionality by 
learning distributed representations for words (G.E. Hin¬ 
ton, 1986; Bengio et al., 2003). Specifically, neural lan¬ 
guage models embed a vocabulary into a smaller dimen¬ 
sional linear space that models “the probability function 
for word sequences, expressed in terms of these represen¬ 
tations” (Bengio et al., 2003). The result is a vector space 
model (Maas & Ng, 2010) that encodes semantic and syn¬ 
tactic relationships and has defined a new standard for fea¬ 
ture generation in NLP (Manning et al., 2008; Sebastian!, 
2002; Turian et al., 2010). 

NNLMs generate word embeddings by training a symbol 
prediction task over a moving local-context window such as 
predicting a word given its surrounding context (Mikolov 
et al., 2013a;b). This work follows from the distributional 
hypothesis: words that appear in similar contexts have sim¬ 
ilar meaning (Harris). Words that appear in similar con¬ 
texts will experience similar training examples, training 
outcomes, and converge to similar weights. The ordered set 
of weights associated with each word becomes that word’s 
dense vector embedding. These distributed representations 
encode shades of meaning across their dimensions, allow¬ 
ing for two words to have multiple, real-valued relation¬ 
ships encoded in a single representation (Liang & Potts, 
2015). 

(Mikolov et al., 2013c) introduced a new property of word 
embeddings based on word analogies such that vector op¬ 
erations between words mirror their semantic and syntactic 
relationships. The analogy ’’king is to queen as man is to 
woman” can be encoded in vector space by the equation 
king - queen = man - woman. A dataset of these analogies, 
the Google Analogy Dataset ', is divided into two broad 
categories, semantic queries and syntactic queries. Seman¬ 
tic queries idenfity relationships such as “France is to Paris 

'http://word2vec.googlecode.com/svn/trunk/ 
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as England is to London” whereas syntactic queries iden¬ 
tify relationships such as “running is to run as pruning is 
to prune”. This is a standard by which distributed word 
embeddings may be evaluated. 

Until recently, NNLMs have ignored morphology and word 
shape. However, including information about word struc¬ 
ture in word representations has proven valuable for part 
of speech analysis (Santos & Zadrozny, 2014), word simi¬ 
larity (Luong et al., 2013), and information extraction (Qi 
et al., 2014). 

We propose a neural network architecture that explicitly en¬ 
codes order in a sequence of symbols and use this archi¬ 
tecture to embed both word-level and character-level rep¬ 
resentations. When these two representations are concate¬ 
nated, the resulting representations exceed best published 
results in both the semantic and syntactic evaluations of the 
Google Analogy Dataset. 

2. Related Work 

2.1. Word-level Representations (Word2vec) 

Our technique is inspired by recent work in learning vec¬ 
tor representations of words, phrases, and sentences using 
neural networks (Mikolov et al., 2013a;b; Le & Mikolov, 
2014). In the CBOW configuration of the negative sam¬ 
pling training method by (Mikolov et al., 2013a), each 
word is represented by a row-vector in matrix syng and is 
concatenated, summed, or averaged with other word vec¬ 
tors in a context window. The resulting vector is used in 
a classifier syni to predict the existence of the whole con¬ 
text with the the focus term (positive training) or absence 
of other randomly sampled words in the window (negative 
sampling). The scalar output is passed through a sigmoid 
function {cr{z) = (1 -f returning the network’s 

probability that the removed word exists in the middle of 
the window, without stipulation on the order of the context 
words. This optimizes the following objective: 


argmax p{w=l\C;9) p{w = 0\C;9) 

{w,C)Gd (w,C)Gd^ 


where d represents the document as a collection of context- 
word pairs (w, C) and C is an unordered group of words in 
a context window, d' is a set of random (w, C) pairs. 0 will 
be adjusted such that p{w = 1, C; 0) = 1 for context-word 
pairs that exist in d, and 0 for random context-word pairs 
that do not exist in d'. In the skip-gram negative sampling 
work by (Mikolov et al., 2013a;b), each word in a con¬ 
text is trained in succession. This optimizes the following 
objective: 


argmax p{w = l\c;9) p{w = Q\c-,9) 

{'w,c)^d {w,c)^d' 

where d represents the document as a collection of context- 
word pairs {w, c) and c represents a single word in the con¬ 
text. Modeling an element-wise probability that a word oc¬ 
curs given another word in the context, the element-wise 
nature of this probability allows (2) to be an equivalent 
objective to the skip-gram objective outlined in (Mikolov 
et al., 2013b; Goldberg & Levy, 2014). 

Reducing the window size under these models constrains 
the probabilities to be more localized, as the probability 
that two words co-occur will reduce when the window 
reduces which can be advantageous for words subject to 
short-windowed statistical significance. Lor example, cur¬ 
rency symbols often co-occur with numbers within a small 
window. Outside of a small window, currency symbols and 
numbers are not likely to co-occur. Thus, reducing the 
window size reduces noise in the prediction. Words such 
as city names, however, prefer wider windows to encode 
broader co-occurrence statistics with other words such as 
landmarks, street-names, and cultural words which could 
be farther away in the document. 



Figure 1. Diagram of word2vec’s Continuous Bag of Words train¬ 
ing method over the sentence “SEE SPOT RUN”. Embeddings for 
“SEE” and “RUN” are summed into a third vector that is used to 
predict the probability that the middle word is “SPOT”. 

Neither skip-gram nor CBOW explicitly preserve word or¬ 
der in their word embeddings (Mikolov et al., 2013a;b; Le 
& Mikolov, 2014). Ordered concatenation of syriQ vectors 
does embed order in syni, but this is obfuscated by the fact 
that the same embedding for each word must be linearly 
compatible with the feature detectors in every window po¬ 
sition. In addition to changing the objective function, this 
has the effect of cancelling out features that are unique to 
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only one window position by those in other window posi¬ 
tions that are attempting to be encoded in the same feature 
detector dimension. This effect prevents word embeddings 
from preserving order based features. The other meth¬ 
ods (sum, average, and skip-gram) ignore all order com¬ 
pletely in their modeling and model only co-occurrence 
based probability in their embeddings. 

2.2. Character-level Representations 

Recent work has explored techniques to embed word shape 
and morphology features into word embeddings. The re¬ 
sulting embeddings have proven useful for a variety of NLP 
tasks. 

2.2.1. Deep Neural Network 

(Santos & Zadrozny, 2014) proposed a Deep Neural Net¬ 
work (DNN) that “learns character-level representation[s] 
of words and associate[s] them with usual word represen¬ 
tations to perform POS tagging.” The resulting embeddings 
were used to produce state-of-the-art POS taggers for both 
English and Portuguese data sets. The network architecture 
leverages the convolutional approach introduced in (Waibel 
et al., 1990) to produce local features around each charac¬ 
ter of the word and then combines them into a fixed-sized 
character-level embedding of the word. The character-level 
word embedding is then concatenated with a word-level 
embedding learned using word2vec. Using only these em¬ 
beddings, (Santos & Zadrozny, 2014) achieves state-of- 
the-art results in POS tagging without the use of hand- 
engineered features. 

2.2.2. Recursive Neural Network 

(Luong et al., 2013) proposed a “novel model that is capa¬ 
ble of building representations for morphologically com¬ 
plex words from their morphemes.” The model leverages 
a recursive neural network (RNN) (Socher et al., 2011) 
to model morphology in a word embedding. Words are 
decomposed into morphemes using a morphological seg- 
menter (Creutz & Lagus, 2007). Using the “morphemic 
vectors”, word-level representations are constructed for 
complex words. In the experiments performed by (Lu¬ 
ong et al., 2013), word embeddings were borrowed from 
(Huang et al., 2012) and (Collobert et al., 201 1). After con¬ 
ducting a morphemic segmentation, complex words were 
then enhanced with morphological feature embeddings by 
using the morphemic vectors in the RNN to compute word 
representations “on the fly”. The resulting model outper¬ 
forms existing embeddings on word similarity tasks accross 
several data sets. 


3. The Partitioned Embedding Neural 
Network Model (PENN) 



Figure 2. The Windowed configuration of PENN when using the 
CLOW training method modeling “SEE SPOT RUN”. 

We propose a new neural language model called a Par¬ 
titioned Embedding Neural Network (PENN). PENN im¬ 
proves upon word2vec by modeling the order in which 
words occur. It models order by partitioning both the 
embedding and classifier layers. There are two styles of 
training corresponding to the CBOW negative sampling 
and skip-gram negative sampling methods in word2vec, al¬ 
though they differ in key areas. 

The first property of PENN is that each word embedding 
is partitioned. Each partition is trained differently from 
each other partition based on word order, such that each 
partition models a different probability distribution. These 
different probability distributions model different perspec¬ 
tives on the same word. The second property of PENN is 
that the classifier has different inputs for words from dif¬ 
ferent window positions. The classifier is partitioned with 
equal partition dimensionality as the embedding. It is pos¬ 
sible to have fewer partitions in the classifier than the em¬ 
bedding, such that a greater number of word embeddings 
are summed/averaged into fewer classifier partitions. This 
configuration has better performance when using smaller 
dimensionality feature vectors with large windows as it bal¬ 
ances the (embedding partition size) / (window size) ratio. 
The following subsection presents the two opposite config¬ 
urations under the PENN framework. 

3.1. Plausible Configurations 

3.1.1. Windowed 

The simplest configuration of a PENN architecture is the 
windowed configuration, where each partition corresponds 
to a unique window position in which a word occurs. As 
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illustrated in Figure 2, if there are two window positions 
(one on each side of the focus term), then each embedding 
would have two partitions. When a word is in partition p 
= +1 (the word before the focus term), the partition corre¬ 
sponding to that position is propagated forward, and sub¬ 
sequently back propagated into, with the p = -1 partition 
remaining unchanged. 

3.1.2. Directional 

The opposite configuration to windowed PENN is the di¬ 
rectional configuration. Instead of each partition corre¬ 
sponding to a window position, there are only two par¬ 
titions. One partition corresponds to every positive, for¬ 
ward predicting window position (left of the focus term) 
and the other partition corresponds to every negative, back¬ 
ward predicting window position (right of the focus term). 
For each partition, all embeddings corresponding to that 
partition are summed or averaged when being propagated 
forward. 



Figure 3. The Directional configuration of PENN when using the 
CLOW training method. It is modeling the sentence “SEE SPOT 
RUN PAST”. 

3.2. Training Styles 

3.2.1. Continuous List of Words (CLOW) 

The Continuous List of Words (CLOW) training style un¬ 
der the PENN framework optimizes the following objective 
function; 

arg max( n n p(w = l|c;^-;6>) 

(w,C)Gd —c<j<c,j^0 

n n 

{w,C)ed' —c<j<c,j^0 


where cj is the location specific representation (partition 
^) for the word at window position j relative to the focus 
word w. Closely related to the CBOW training method, the 
CLOW method models the probability that in an ordered 
list of words, a specific word is present in the middle of 
the list, given the presence and location of the other words. 
For each training example out of a windowed sequence of 
words, the middle focus term is removed. Then, a par¬ 
tition is selected from each remaining word’s embedding 
based on that word’s position relative to the focus term. 
These partitions are concatenated and propagated through 
the classifier layer. All weights are updated to model the 
probability that the presence of the focus term is 100% 
(positive training) and other randomly sampled words 0% 
(negative sampling). 

3.2.2. Skip-Gram 

The skip-gram training style under the PENN framework 
optimizes the following objective function 

argmax( ^ p{wj = 6) 

{w,C)ed —c<j<c,j^0 

n n pk=0|c^-^)) 

{w,C)ed' —c<j<c,j^0 

where, like CLOW, Cj is the location specific representa¬ 
tion (partition for the word at window position j relative 
to the focus word w. Wj is the relative location specific 
probability (partition) of the focus term. PENN skip-gram 
is almost identical to the CLOW method with one key dif¬ 
ference. Instead of each partition of a word being concate¬ 
nated with partitions from neighboring words, each parti¬ 
tion is fed forward and back propagated in isolation. This 
models the probability that, given a single word, the focus 
term is present a relative number of words away in a given 
direction. This captures information lost in the word2vec 
skip-gram architecture by modeling based on the relative 
location of a context word in the window as opposed to an 
arbitrary location within the window. 

The intuition behind modeling w and c based on j at the 
same time becomes clear when considering the neural ar¬ 
chitecture of these embeddings. Partitioning the context 
word into j partitions gives a location specific representa¬ 
tion for a word’s relative position. Location specific rep¬ 
resentations are important even for words with singular 
meanings. Consider the word “going”, a word of singu¬ 
lar meaning. This word’s effect on a task predicting a word 
immediately before it is completely different than predict¬ 
ing a word immediately after it. The phrase “am going” is a 
plausible phrase. The phrase “going am” is not. Thus, forc¬ 
ing this word to have a consistent embedding across these 
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tasks forces it to convey identical information optimizing 
for nonidentical problems. 

Partitioning the classifier incorporates this same principle 
with respect to the focus word. The focus word will read 
features presented to it in a different light with a differ¬ 
ent weighting given its position. For example, “dollars” 
is far more likely to be predicted accurately based on the 
word before it; whereas, it is not likely to be predicted cor¬ 
rectly by a word ten window positions after. Thus, the clas¬ 
sifier responsible for looking for features indicating that 
“dollars” is next should not have to be the same classifier 
that looks for features ten window positions into the future. 
Training separate classifier partitions based on window po¬ 
sition avoids this phenomenon. 

3.3. Distributed Training Optimizations 

3.3.1. Skip-Gram 

When skip-gram is used to model ordered sets of words 
under the PENN framework each classifier partition and 
its associated embedding partitions may be trained in full- 
parallel (with no inter-communication) and reach the exact 
same state as if they were not distributed. A special case of 
this is the windowed embedding configuration, where every 
window position can be trained in full parallel and concate¬ 
nated (embeddings and classifiers) at the end of training. 
This allows very large, rich embeddings to be trained on 
relatively small, inexpensive machines in a small amount of 
time with each machine optimizing a part of the overall ob¬ 
jective function. Given machine j, training skip-gram un¬ 
der the windowed embedding configuration optimizes the 
following objective function; 

arg max( p{wj = 1 |c^; 6») 

{w,C)^d 

n pK = o|'^;^)) 

{w,C)^d' 

Concatenation of the weight matrices syriQ and syni then 
incorporates the sum over j back into the PENN skip-gram 
objective function during the forward propagation process, 
yielding identical training results as a network trained in 
a single-threaded, single-model PENN skip-gram fashion. 
This training style achieves parity training results with cur¬ 
rent state-of-the-art methods while training in parallel over 
as many as j separate machines. 

3.3.2. CLOW 

The CLOW method is an excellent candidate for the 
ALOPEX distributed training algorithm (Unnikrishnan & 


Venugopal, 1994) because it trains on very few (often sin¬ 
gle) output probabilities at a time. Different classifier par¬ 
titions may be trained on different machines, with each 
training example sending a short list of floats per machine 
across the network. They all share the same global error 
and continue on to the next iteration. 

A second, nontrivial optimization is found in the strong per¬ 
formance of the directional CLOW implementation with 
very small window sizes (pictured below with a window 
size of 1). Directional CLOW is able to achieve a parity 
score using a window size of 1, contrasted with word2vec 
using a window size of 10 when all other parameters are 
equal, reducing the overall training time by a factor of 10. 

4. Dense Interpolated Embedding Model 
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Table 1. A focus character and the 4 closest characteres ordered 
by cosine similarity. 


SEMANTIC 

SYNTACTIC 

“general” - similarity 

secretary 

0.619 

gneral 

0.986 

elections 

0.563 

genral 

0.978 

motors 

0.535 

generally 

0.954 

undersecretary 

0.534 

generation 

0.944 

“sees” 

- “see” 

y “hank”= 


firestone 

0.580 

banks 

0.970 

yard 

0.545 

bank 

0.939 

peres 

0.506 

balks 

0.914 

c.c 

0.500 

bans 

0.895 


Table 2. An example of syntactic vs semantic embeddings on the 
cosine similarity and word-analogy tasks. 

We propose a second new neural language model called a 
Dense Interpolated Embedding Model (DIEM). DIEM uses 
neural embeddings learned at the character level to generate 
a fixed-length syntactic embedding at the world level use¬ 
ful for syntactic word-analogy tasks, leveraging patterns in 
the characters as a human might when detecting syntactic 
features such as plurality. 

4.1. Method 

Generating syntactic embeddings begins by generating 
character embeddings. Character embeddings are gener¬ 
ated using vanilla word2vec by predicting a focus charac- 
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Algorithm 1 Dense Interpolated Embedding Pseudocode 
Input: wordlength I, list char embeddings (e.g. the 
word) chavi, multiple M, char dim C, vector Vm 
for i = 0 to / — 1 do 

s =M * i/1 

for m = 0 to M — 1 do 

d = pow(l - iabs(s - m)) / M,2) 

Vm = Vm + d* chari 

end for 
end for 


ter given its context. This clusters characters in an intuitive 
way, vowels with vowels, numbers with numbers, and cap¬ 
itals with capitals. In this way, character embeddings rep¬ 
resent morphological building blocks that are more or less 
similar to each other, based on how they have been used. 

Once character embeddings have been generated, interpo¬ 
lation may begin over a word of length I. The final embed¬ 
ding size must be selected as a multiple M of the character 
embedding dimensionality C. For each character in a word, 
its index i is first scaled linearly with the size of the final 
“syntactic” embedding such that s = M * i / 1. Then, for 
each length C position m (out of M positions) in the final 
word embedding Vm, a squared distance is calculated rela¬ 
tive to the scaled index such that distance d = pott;(l-(abs(s 
- j)) / M,2). The character vector for the character at posi¬ 
tion i in the word is then scaled by d and added elementwise 
into position m of vector v. 

A more efficient form of this process caches a set of trans¬ 
formation matrices, which are cached values of dt^m for 
words of varying size. These matrices are used to trans¬ 
form variable length concatenated character vectors into 
hxed length word embeddings via vector-matrix multipli¬ 
cation. 

These embeddings are useful for a variety of tasks, includ¬ 
ing syntactic word-analogy queries. Furthermore, they are 
useful for syntactic query expansion, mapping sparse edge 
cases of a word (typos, odd capitalization, etc.) to a more 
common word and its semantic embedding. 

4.2. Distributed Use and Storage Optimizations 

Syntactic vectors also provide significant scaling and gen¬ 
eralization advantages over semantic vectors. New syn¬ 
tactic vectors may be inexpensively generated for words 
never before seen, giving loss-less generalization to any 
word from initial character training, assuming only that the 
word is made up of characters that have been seen. Syn¬ 
tactic embeddings can be generated in a fully distributed 
fashion and only require a small vector concatenation and 
vector-matrix multiplication per word. Secondly, the char¬ 
acter vectors (typically length 32) and transformation ma¬ 


trices (at most 20 or so of them) can be stored very ef- 
hciently relative to the semantic vocabularies, which can 
be several million vectors of dimensionality 1000 or more. 
Despite their significant positive impact on quality, DIEM 
optimally performs using 6+ orders of magnitude less stor¬ 
age space, and 5 h- orders of magnitude fewer training ex¬ 
amples than word-level semantic embeddings. 

5. Experiments 

5.1. Evaluation Methods 

We conduct experiments on the word-analogy task of 
(Mikolov et al., 2013a). It is made up of a variety of 
word similarity tasks, as described in (Fuong et al., 2013). 
Known as the Google Analogy Dataset, it contains 19,544 
questions asking ”a is to b as c is to and is split into 
semantic and syntactic sections. Both sections are further 
divided into subcategories based on analogy type, as indi¬ 
cated in the results tables below. 

All training occurs over the dataset available from the 
Google word2vec website^, using the packaged word- 
analogy evaluation script. The dataset contains approxi¬ 
mately 8 billion words collected from English News Crawl, 
1-Billion-Word Benchmark, UMBC Webbase, and English 
Wikipedia. The dataset used leverages the default data- 
phrase2.txt normalization in all training, which includes 
both single tokens and phrases. Unless otherwise speci¬ 
fied, all parameters for training and evaluating are identical 
to the default parameters specified in the default word2vec 
big model, which is freely available online. 

5.2. Embedding Partition Relative Evaluation 

Figure 4 displays the relative accuracy of each partition 
in a PENN model as judged by row-relative word-analogy 
scores. Other experiments indicated that the pattern present 
in the heat-map is consistent across parameter tunings. 
There is a clear quality difference between window posi¬ 
tions that predict forward (left side of the figure) and win¬ 
dow positions that predict backward (right side of the fig¬ 
ure). “currency” achieves most of its predictive power in 
short range predictions, whereas “capital-common coun¬ 
tries” is a much smoother gradient over the window. These 
patterns support the intuition that different window posi¬ 
tions play different roles in different tasks. 

5.3. Evaluation of CLOW and CBOW 

Table 3 shows the performance of the default CBOW im¬ 
plementation of word2vec relative to CLOW and DIEM 
when configured to 2000 dimensional embeddings. Be¬ 
tween tables 3 and 4, we see that increasing dimension- 

^https://code.google.com/p/word2vec/ 
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Figure 4. Green represents the highest quality partition. Red indicates the lowest. Gray indicates the gradient performance between red 
and green. Two greens in the same row indicates a tie within a 1% margin. 


ality of baseline CBOW word2vec past 500 achieves sub- 
optimal performance. Thus, a fair comparison of two mod¬ 
els should be between optimal (as opposed to just identical) 
parameterization for each model. This is especially impor¬ 
tant given that PENN models are modeling a much richer 
probability distribution, given that order is being preserved. 
Thus, optimal parameter settings often require larger di¬ 
mensionality. Unlike the original CBOW word2vec, we 
have found that bigger window size is not always bet¬ 
ter. Larger windows tend to create slightly more semantic 
embeddings, whereas smaller window sizes tend to create 
slightly more syntactic embeddings. This follows the in¬ 
tuition that syntax plays a huge role in grammar, which is 
dictated by rules about which words make sense to occur 
immediately next to each other. Words that are h- 5 words 
apart cluster based on subject matter and semantics as op¬ 
posed to grammar. With respect to window size and overall 
quality, because partitions slice up the global vector for a 
word, increasing the window size decreases the size of each 
partition in the window if the global vector size remains 
constant. Since each embedding is attempting to model a 
very complex (hundreds of thousands of words) probability 
distribution, the partition size in each partition must remain 
high enough to model this distribution. Thus, modeling 
large windows for semantic embeddings is optimal when 
using either the directional embedding model, which has a 
hxed partition size of 2, or a large global vector size. The 


directional model with optimal parameters has slightly less 
quality than the windowed model with optimal parameters 
due to the vector averaging occurring in each window pane. 


5.4. Evaluation of DIEM Syntactic Vectors on 
Syntactic Tasks 


Semantic Architecture 

CBOW 

CLOW 

DIEM 

Semantic Vector Dim. 

500 

500 

500 

SEMANTIC TOTAL 

81.02 

80.19 

80.19 

adjective-to-adverb 

37.70 

35.08 

94.55 

opposite 

36.21 

40.15 

74.60 

comparative 

86.71 

87.31 

92.49 

superlative 

80.12 

82.00 

87.61 

present-participle 

77.27 

80.78 

93.27 

nationality-adjective 

90.43 

90.18 

71.04 

past-tense 

72.37 

73.40 

47.56 

plural 

80.18 

81.83 

93.69 

plural-verbs 

58.51 

63.68 

95.97 

SYNTACTIC TOTAL 

72.04 

73.45 

81.53 

COMBINED SCORE 

76.08 

76.49 

80.93 


Table 4. Above we see can observe the boost that syntactic based 
DIEM feature vectors gives our unsupervised semantic models, 
relative to both word2vec-CBOW and CLOW 

Table 4 documents the change in syntactic analogy query 
quality as a result of the interpolated DIEM vectors. Eor 
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Configuration Style 

W2V 

Window 

(see tbl. 5) 

Training Style 

CBOW 

CLOW 

ENSEM 

Word Vector Size 

2000 

2000 

7820 

Partition Size 

2000 

500 

(see tbl. 5) 

Window Size 

10 

2 

(see tbl. 5) 

capital-common 

85.18 

98.81 

95.65 

capital-world 

75.38 

90.01 

93.90 

currency 

0.40 

16.89 

17.32 

city-in-state 

65.18 

78.31 

78.88 

family 

49.01 

84.39 

85.35 

SEMANTIC 

65.11 

80.62 

82.70 

adjective-to-adverb 

15.62 

30.04 

90.73 

opposite 

8.50 

38.55 

73.15 

comparative 

51.95 

94.37 

99.70 

superlative 

33.87 

79.77 

91.89 

present-participle 

45.45 

81.82 

93.66 

nationality-adjective 

88.56 

89.38 

91.43 

past-tense 

55.19 

76.99 

60.01 

plural 

73.05 

83.93 

97.90 

plural-verbs 

28.74 

73.33 

95.86 

SYNTACTIC 

49.42 

75.11 

88.29 

TOTAL 

56.49 

77.59 

85.77 


Table 3. Comparison between Word2vec, CLOW, and Penn- 
DIEM Ensemble 


Conf. Training Style 

Window Size 

Dimensionality 

Windowed 

10 

500 

Directional 

5 

500 

Windowed 

2 

2000 

Directional 

5 

2000 

Directional 

10 

2000 

Directional 

1 

500 

DIEM 

X 

320 


Table 5. Concatenated Model Configurations 


the DIEM experiment, each analogy query was first per¬ 
formed by running the query on CLOW and DIEM inde¬ 
pendently, and selecting the top thousand CLOW cosine 
similarities. We summed the squared cosine similarity of 
each of these top thousand with each associated cosine sim¬ 
ilarity returned by the DIEM and resorted. This was found 
to be an efficient estimation of concatenation that did not 
reduce quality. 

Table 5 documents the parameter selection for a combined 
neural network partitioned according to several training 
styles and dimensionalities. As in the experiments of Table 
3, each analogy query was first performed by running the 
query on each model independently, selecting the top thou¬ 
sand cosine similarities. We summed the cosine similar¬ 
ity of each of these top thousand entries across all models 
(excluding DIEM for semantic queries) and resorted. (Eor 


normalization purposes, DIEM scores were raised to the 
power of 10 and all other scores were raised to the power 
of 0.1 before summing). 

5.5. High Level Comparisons 


Algorithm 

GloVe 

Word2Vec 

PENN+D 

Config 

X 

CBOW 

SG 

SG 

ENS 

Params 

X 

7.6 B 

7.6 B 

40B 

59B 

Sem. Dims 

300 

500 

500 

5000 

7820 

Semantic 

81.9 

81.0 

82.2 

69.6 

82.7 

Syntactic 

69.3 

72.0 

71.3 

80.0 

88.3 

Combined 

75.0 

76.1 

76.2 

75.3 

85.8 


Table 6. Scores reflect best published results in each category, se¬ 
mantic, syntactic, and combined when parameters are tuned opti¬ 
mally for each individual category. 

Our final results show a lift in quality and size over pre¬ 
vious models with a 58% syntactic lift over the best pub¬ 
lished syntactic result, and a 40% overall lift over the best 
published overall result (Pennington et al., 2014). Table 5 
also includes the highest word2vec scores we could achieve 
through better parameterization (which also exceeds the 
best published word2vec scores). Within PENN models, 
there exists a speed vs. performance tradeoff between SG- 
DIEM and CLOW-DIEM. In this case, we achieve a 20x 
level of parallelism in SG-DIEM relative to CLOW, with 
each model training partitions of 250 dimensions (250 * 
20 = 5000 final dimensionality). A 160 billion param¬ 
eter network was also trained overnight on 3 multi-core 
CPUs, however it yielded 20000 dimensional vectors for 
each word and subsequently overfit the training data. This 
is because a dataset of 8 billion tokens with a negative 
sampling parameter of 10 has 80 billion training examples. 
Having more parameters than training examples overfits a 
dataset, whereas 40 billion performs at parity with current 
state of the art, as pictured in Table 5. Euture work will 
experiment with larger datasets and vocabularies. The pre¬ 
vious largest neural network contained 11.2 billion param¬ 
eters (Coates et al., 2013), whereas CLOW and the largest 
SG contain 16 billion (trained all together) and 160 billion 
(trained across a cluster) parameters respectively as mea¬ 
sured by the number of weights. 

6. Conclusion and Future Work 

Encoding both word and character order in neural word em¬ 
beddings is beneficial for word-analogy tasks, particularly 
syntactic tasks. These findings are based upon the intuition 
that order matters in human language and has been vali¬ 
dated through the methods above. Euture work will fur¬ 
ther investigate the scalability of these word embeddings to 
larger datasets with reduced runtimes. 
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